Zhang Z, Xie Y, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6199-6208.
1. Overview
1.1. Motivation
- fully end-to-end map from low-dimension text space to a high-resolution image space still remains unsolved
- two difficulty
- balance the convergence between G and D
- stably model the huge pixel space in high-resolution images and guaranteeing semantic consistency
In this paper, it proposed an extensille single-stream generator architecture (HDGAN)
- jointed discriminator. regularize mid-level representation and assist generator training to capture the complex image statistics
- multi-purpose adversarial loss
new visual-semantic similarity measure
single stage
- no multiple text condition
- no additional class label supervision
1.2. Dataset
- CUB birds
- Oxford-102 flowers
- MSCOCO
1.3. Related Work
1.3.1. Generative Models
- GAN
- VAE
1.3.2. Text-to-Image
- (ICML 2016) GAN
- (NIPS 2016) GAN what-where network
- (ICCV 2017) StackGAN
- (ICCV 2017) joint embedding
- perceptual loss
- auxiliary classifier
- attention-driven
1.3.3. Stability of GAN
- training techniques
- regularization using extra knowledge
- combination of G and D
As the targeting image resolution increases, training difficulty increases.
1.3.4. Decompose into Multiple Subtasks
- LAP-GAN
- symmetric G and D
- stage-by-stage
2. Methods
2.1. Hierarchical-nested Adversarial Objective
- G. hierarchical generator
- z. noise
- t. sentence embedding by pre-trained char-RNN text encoder
- s. number of scales
X_n. gradually growing resolution
lower-resolution. learn semantic consistent image structure
- higher-resolution. render fine-grained details
2.2. Multi-purpose Adversarial Loss
- pair loss (1). guarantee the global semantic consistency
- Image loss (R_i x R_i). low-resolution D focus on global structures, high focus on local image details
- output.
- two type of errors
- real image + mismatched text
- fake image + conditioned text
2.2.1. D
2.2.2. G
2.2.3. Conditioning Augmentation
Instead of directly using the deterministic text embedding, sample a stochastic vector from a Gaussian distribution
And add Kullback-Leiblere divergence regularization term to prevent over-fitting and force smooth
2.3. Architecture
2.3.1. G
- three modules
- K-repeat ResBlock. 2 Conv + BN-ReLU
- stretching layers. x2 nearest upsample + Conv-BN-ReLU
- linear compression layers. Conv-Tanh
structure:1-2-1-2-….
- text embedding of CA. 1024 x 4 x 4
2.3.2. D
- stride-2 conv + BN-LeakyReLU
- two branch following
- FCN produce R_i x R_i probability map
- concat 512 x 4 x 4 feature map and 128 x 4 x 4 text embedding (reduced), 1x1 Conv + 4x4 Co
3. Experiments
3.1. Metrics
- Inception Score. need pre-trained Inception model (fine-tuned on dataset of this paper)
- Multi-scale Structural Similarity (MS-SSIM). pairwise similarity, lower score indicates higher diversity of generated image
- VIsual-semantic Similarity. train a visual-semantic embedding model to measure the distance between text and image
- δ. margin set to 0.2
3.2. Comparison
- better preserve semantically consistent information in all resolution
3.3. Style Transfer
3.4. Ablation Study
- local image loss for higher performantic and offer the pair loss more focus on learning sementic consistency
3.5. share top layers of Ds
- not observe benifits